Project: Autshumato III

Type: Aligned parallel corpus
Languages: English (en_GB, eng_GB) & Setswana (tn_ZA, tsn_ZA)
Date: 2016-05-04
Version: 1.0.0 (Final)

Description: 
Aligned English-Setswana parallel corpora.
The Bilingual Data is divided into 3 separate sets
	SET 1: Corpus.DACB3.BilingualData_Translated
	SET 2: Corpus.DACB3.BilingualData_ReliableSources
	SET 3: Corpus.DACB3.BilingualData_Other

SET 1: This set contains data translated from English into Setswana by professional translators. 
File name: Corpus.DACB3.BilingualData_Translated.1.0.0
Lines: 31 376
English Words: 277 283
Setswana Words: 324 342

SET 2: This set contains data that was sourced as translated file pairs from translators.
File name: Corpus.DACB3.BilingualData_ReliableSources.1.0.0
Lines: 54 431
English Words: 869 016
Setswana Words: 1 099 509

SET 3: This set contains all the data that does not fall in one of the above categories, including data crawled from various government websites.
File name: Corpus.DACB3.BilingualData_Other.1.0.0
Lines: 73 193
English Words: 890 874
Setswana Words: 1 172 172

Source(s): 
Various sources, predominantly government domain.

Project website: http://autshumato.sourceforge.net/
_________________________________________________________________________________
Licence: Creative Commons Attribution 2.5 South Africa
 
URL: http://creativecommons.org/licenses/by/2.5/za/
 
Attribute work to: 
	CTexT (Centre for Text Technology, North-West University), South Africa; 
	Department of Arts and Culture, South Africa.
Attribute work to URL:	
	http://www.nwu.ac.za/ctext and 
	http://www.dac.gov.za/

